Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.
The objective is to build various classification models, tune them and find the best one that will help identify failures so that the generator could be repaired before failing/breaking and the overall maintenance cost of the generators can be brought down.
“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.
The nature of predictions made by the classification model will translate as follows:
So, the maintenance cost associated with the model would be:
Maintenance cost = TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost)
where,
Replacement cost = $40,000Repair cost = $15,000Inspection cost = $5,000Here the objective is to reduce the maintenance cost so, we want a metric that could reduce the maintenance cost.
Actual failures*(Repair cost) = (TP + FN)*(Repair cost)TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost)So, we will try to maximize the ratio of minimum possible maintenance cost and the maintenance cost associated with the model.
The value of this ratio will lie between 0 and 1, the ratio will be 1 only when the maintenance cost associated with the model will be equal to the minimum possible maintenance cost.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores and split data
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To impute missing values
from sklearn.impute import KNNImputer
# To build a logistic regression model
from sklearn.linear_model import LogisticRegression
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To suppress the warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%reload_ext nb_black
# To use metrics and make_scorer in the metric Minimum_Vs_Model_cost
from sklearn.metrics import make_scorer
from sklearn import metrics
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
renew_train = pd.read_csv("Train.csv")
renew_test = pd.read_csv("Test.csv")
# Checking the number of rows and columns in the train dataset
renew_train.shape
(40000, 41)
# Checking the number of rows and columns in the test dataset
renew_test.shape
(10000, 41)
# Combine the train and test datasets for EDA
frames = [renew_train, renew_test]
renew_combined = pd.concat(frames)
renew_combined.shape
(50000, 41)
# let's create a copy of the data
data = renew_combined.copy()
# let's view the first 5 rows of the data
data.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | -2.909996 | -2.568662 | 4.109032 | 1.316672 | -1.620594 | -3.827212 | -1.616970 | 0.669006 | 0.387045 | 0.853814 | ... | -3.782686 | -6.823172 | 4.908562 | 0.481554 | 5.338051 | 2.381297 | -3.127756 | 3.527309 | -3.019581 | 0 |
| 2 | 4.283674 | 5.105381 | 6.092238 | 2.639922 | -1.041357 | 1.308419 | -1.876140 | -9.582412 | 3.469504 | 0.763395 | ... | -3.097934 | 2.690334 | -1.643048 | 7.566482 | -3.197647 | -3.495672 | 8.104779 | 0.562085 | -4.227426 | 0 |
| 3 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 4 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
5 rows × 41 columns
# let's view the last 5 rows of the data
data.tail()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9995 | -2.846746 | -0.808828 | 0.475822 | -3.742892 | 3.009556 | 1.825422 | -3.675837 | -2.057177 | -0.021756 | -0.848866 | ... | 6.148098 | 6.500759 | -7.757896 | 1.956287 | 3.601460 | 0.691951 | -1.310080 | 0.907578 | 1.816915 | 0 |
| 9996 | -1.439081 | 3.414045 | -2.786811 | 9.378783 | -2.252089 | -2.817603 | 1.124661 | -1.287717 | 5.115467 | -3.412366 | ... | -6.005167 | -7.740605 | 7.213442 | -2.636287 | -8.584351 | 1.553226 | 2.756595 | -0.101038 | -6.162453 | 1 |
| 9997 | -1.703241 | 0.614650 | 6.220503 | -0.104132 | 0.955916 | -3.278706 | -1.633855 | -0.103936 | 1.388152 | -1.065622 | ... | -4.100352 | -5.949325 | 0.550372 | -1.573640 | 6.823936 | 2.139307 | -4.036164 | 3.436051 | 0.579249 | 0 |
| 9998 | -2.037301 | -4.068539 | 0.525798 | 0.598100 | -2.517844 | -4.180772 | 1.110016 | 5.443957 | -3.856603 | 2.524608 | ... | -0.520402 | -6.952070 | 7.476771 | 1.178398 | 3.912227 | 1.972555 | -2.149725 | 1.569903 | -2.306153 | 0 |
| 9999 | -0.603701 | 0.959550 | -0.720995 | 8.229574 | -1.815610 | -2.275547 | -2.574524 | -1.041479 | 4.129645 | -2.731288 | ... | 2.369776 | -1.062408 | 0.790772 | 4.951955 | -7.440825 | -0.069506 | -0.918083 | -2.291154 | -5.362891 | 0 |
5 rows × 41 columns
# let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50000 entries, 0 to 9999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 49943 non-null float64 1 V2 49954 non-null float64 2 V3 50000 non-null float64 3 V4 50000 non-null float64 4 V5 50000 non-null float64 5 V6 50000 non-null float64 6 V7 50000 non-null float64 7 V8 50000 non-null float64 8 V9 50000 non-null float64 9 V10 50000 non-null float64 10 V11 50000 non-null float64 11 V12 50000 non-null float64 12 V13 50000 non-null float64 13 V14 50000 non-null float64 14 V15 50000 non-null float64 15 V16 50000 non-null float64 16 V17 50000 non-null float64 17 V18 50000 non-null float64 18 V19 50000 non-null float64 19 V20 50000 non-null float64 20 V21 50000 non-null float64 21 V22 50000 non-null float64 22 V23 50000 non-null float64 23 V24 50000 non-null float64 24 V25 50000 non-null float64 25 V26 50000 non-null float64 26 V27 50000 non-null float64 27 V28 50000 non-null float64 28 V29 50000 non-null float64 29 V30 50000 non-null float64 30 V31 50000 non-null float64 31 V32 50000 non-null float64 32 V33 50000 non-null float64 33 V34 50000 non-null float64 34 V35 50000 non-null float64 35 V36 50000 non-null float64 36 V37 50000 non-null float64 37 V38 50000 non-null float64 38 V39 50000 non-null float64 39 V40 50000 non-null float64 40 Target 50000 non-null int64 dtypes: float64(40), int64(1) memory usage: 16.0 MB
# let's check for duplicate values in the data
data.duplicated().sum()
0
# let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
V1 0.11 V2 0.09 V3 0.00 V4 0.00 V5 0.00 V6 0.00 V7 0.00 V8 0.00 V9 0.00 V10 0.00 V11 0.00 V12 0.00 V13 0.00 V14 0.00 V15 0.00 V16 0.00 V17 0.00 V18 0.00 V19 0.00 V20 0.00 V21 0.00 V22 0.00 V23 0.00 V24 0.00 V25 0.00 V26 0.00 V27 0.00 V28 0.00 V29 0.00 V30 0.00 V31 0.00 V32 0.00 V33 0.00 V34 0.00 V35 0.00 V36 0.00 V37 0.00 V38 0.00 V39 0.00 V40 0.00 Target 0.00 dtype: float64
# let's view the statistical summary of the numerical columns in the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 49943.0 | -0.282464 | 3.447271 | -13.501880 | -2.743078 | -0.763193 | 1.841801 | 17.436981 |
| V2 | 49954.0 | 0.437483 | 3.143469 | -13.212051 | -1.647111 | 0.462847 | 2.535182 | 14.079073 |
| V3 | 50000.0 | 2.515492 | 3.404023 | -12.940616 | 0.209396 | 2.268042 | 4.600534 | 18.366477 |
| V4 | 50000.0 | -0.063615 | 3.442310 | -16.015417 | -2.354226 | -0.131692 | 2.147664 | 13.279712 |
| V5 | 50000.0 | -0.052622 | 2.106177 | -8.612973 | -1.524851 | -0.106190 | 1.341980 | 9.403469 |
| V6 | 50000.0 | -1.003470 | 2.037258 | -10.227147 | -2.367461 | -1.007965 | 0.369929 | 7.065470 |
| V7 | 50000.0 | -0.895904 | 1.752564 | -8.205806 | -2.036467 | -0.935790 | 0.202484 | 8.006091 |
| V8 | 50000.0 | -0.570273 | 3.307717 | -15.657561 | -2.662428 | -0.384405 | 1.710841 | 11.679495 |
| V9 | 50000.0 | -0.001041 | 2.165552 | -8.596313 | -1.492890 | -0.059086 | 1.434368 | 8.850720 |
| V10 | 50000.0 | 0.002395 | 2.180208 | -11.000790 | -1.385728 | 0.113946 | 1.495507 | 8.108472 |
| V11 | 50000.0 | -1.923108 | 3.114148 | -15.384183 | -3.947120 | -1.946113 | 0.086039 | 13.851834 |
| V12 | 50000.0 | 1.572600 | 2.912838 | -13.619304 | -0.440541 | 1.478565 | 3.535923 | 15.753586 |
| V13 | 50000.0 | 1.596705 | 2.864838 | -13.831903 | -0.193629 | 1.669809 | 3.475298 | 15.419616 |
| V14 | 50000.0 | -0.945480 | 1.793375 | -8.309443 | -2.163680 | -0.955735 | 0.264932 | 6.213289 |
| V15 | 50000.0 | -2.438899 | 3.337696 | -17.201998 | -4.447319 | -2.401744 | -0.397467 | 12.874679 |
| V16 | 50000.0 | -2.957676 | 4.224731 | -21.918711 | -5.641959 | -2.732180 | -0.115322 | 13.975843 |
| V17 | 50000.0 | -0.145129 | 3.348054 | -17.633947 | -2.239475 | -0.024522 | 2.066579 | 19.776592 |
| V18 | 50000.0 | 1.187108 | 2.582070 | -12.214016 | -0.402790 | 0.868388 | 2.562075 | 13.642235 |
| V19 | 50000.0 | 1.188688 | 3.399091 | -14.169635 | -1.052110 | 1.285345 | 3.512293 | 16.059004 |
| V20 | 50000.0 | 0.037343 | 3.680876 | -13.922659 | -2.431371 | 0.037406 | 2.516258 | 16.052339 |
| V21 | 50000.0 | -3.632203 | 3.561466 | -19.436404 | -5.943179 | -3.577670 | -1.287824 | 16.218317 |
| V22 | 50000.0 | 0.944643 | 1.640839 | -10.122095 | -0.106597 | 0.965023 | 2.015634 | 7.505291 |
| V23 | 50000.0 | -0.400133 | 4.054329 | -16.187510 | -3.127839 | -0.271121 | 2.428500 | 15.080172 |
| V24 | 50000.0 | 1.132375 | 3.913322 | -18.487811 | -1.503746 | 0.952362 | 3.556193 | 19.769376 |
| V25 | 50000.0 | 0.004513 | 2.023915 | -8.228266 | -1.367289 | 0.028290 | 1.403508 | 8.223389 |
| V26 | 50000.0 | 1.892158 | 3.421522 | -12.587902 | -0.306675 | 1.965322 | 4.165105 | 17.528193 |
| V27 | 50000.0 | -0.605063 | 4.394175 | -14.904939 | -3.678693 | -0.900538 | 2.215126 | 21.594552 |
| V28 | 50000.0 | -0.887399 | 1.925515 | -9.685082 | -2.188972 | -0.910628 | 0.380025 | 7.415659 |
| V29 | 50000.0 | -1.012274 | 2.673050 | -12.579469 | -2.801679 | -1.218713 | 0.595303 | 14.039466 |
| V30 | 50000.0 | -0.039610 | 3.032573 | -14.796047 | -1.919318 | 0.175408 | 2.031358 | 13.190889 |
| V31 | 50000.0 | 0.501403 | 3.476380 | -19.376732 | -1.793735 | 0.496174 | 2.776333 | 17.255090 |
| V32 | 50000.0 | 0.312029 | 5.500217 | -23.200866 | -3.403515 | 0.044506 | 3.784771 | 26.539391 |
| V33 | 50000.0 | 0.046879 | 3.569798 | -17.454014 | -2.245345 | -0.058136 | 2.241522 | 16.692486 |
| V34 | 50000.0 | -0.455852 | 3.182112 | -17.985094 | -2.118038 | -0.240990 | 1.442804 | 14.581285 |
| V35 | 50000.0 | 2.239536 | 2.923146 | -19.522334 | 0.338313 | 2.117117 | 4.051883 | 16.804859 |
| V36 | 50000.0 | 1.534024 | 3.814629 | -17.478949 | -0.928539 | 1.580589 | 4.009270 | 19.329576 |
| V37 | 50000.0 | -0.001614 | 1.779059 | -7.639952 | -1.267810 | -0.136808 | 1.165882 | 7.803278 |
| V38 | 50000.0 | -0.355488 | 3.970872 | -17.375002 | -3.014300 | -0.329640 | 2.298586 | 15.964053 |
| V39 | 50000.0 | 0.904015 | 1.746060 | -7.147175 | -0.255200 | 0.923673 | 2.076958 | 7.997832 |
| V40 | 50000.0 | -0.905036 | 3.000452 | -11.930259 | -2.960029 | -0.957148 | 1.090963 | 10.654265 |
| Target | 50000.0 | 0.054680 | 0.227357 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# build the list of column names
cols = []
for n in range(1, 41):
colnames = "V" + str(n)
cols.append(colnames)
print(cols)
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
# build the histogram and boxplot for all the predictor variables
for column in cols:
histogram_boxplot(data, column)
# let us check the heatmap for the correlations
plt.figure(figsize=(60, 40))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# Checking for correlations > 0.5
corr = data.corr()
# corr[corr > 0.5].dropna(axis=1, how='all').replace(1., np.nan).dropna(how='all', axis=1).dropna(how='all', axis=0).apply(lambda x:x.dropna().to_dict() ,axis=1).to_dict()
corr[corr > 0.5].replace(1, np.nan)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.660713 | NaN | NaN | NaN |
| V3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.537108 | NaN | NaN |
| V4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.621388 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 0.588739 | NaN | NaN | NaN | NaN | 0.631617 | NaN | NaN | NaN |
| V7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 0.531095 | NaN | NaN | NaN | NaN |
| V9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.512303 | NaN | 0.559117 | NaN | NaN | 0.563899 | NaN | NaN |
| V11 | NaN | NaN | NaN | NaN | NaN | 0.711065 | 0.531419 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.679755 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 0.549087 | NaN | NaN | NaN | NaN |
| V13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.549949 | NaN | NaN | ... | NaN | NaN | NaN | NaN | 0.548342 | NaN | NaN | NaN | NaN | NaN |
| V15 | NaN | NaN | NaN | NaN | NaN | NaN | 0.870349 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.803850 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V17 | NaN | NaN | NaN | 0.612700 | NaN | NaN | NaN | 0.510631 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V19 | NaN | NaN | NaN | 0.599953 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.757301 | 0.552482 | NaN | NaN | NaN | NaN | NaN | NaN |
| V20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.504184 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V22 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.720828 | NaN | NaN | ... | 0.642338 | NaN | NaN | NaN | NaN | 0.576491 | NaN | NaN | NaN | NaN |
| V24 | NaN | NaN | NaN | 0.513432 | 0.665766 | NaN | NaN | NaN | NaN | NaN | ... | 0.824683 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V25 | 0.678074 | NaN | 0.605330 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V26 | NaN | 0.786833 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V27 | 0.690023 | NaN | 0.508570 | NaN | NaN | NaN | NaN | NaN | NaN | 0.506132 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.547408 | NaN | NaN |
| V28 | NaN | NaN | NaN | 0.669059 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.561215 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V29 | NaN | NaN | NaN | NaN | NaN | 0.583677 | NaN | NaN | NaN | NaN | ... | NaN | 0.600956 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V30 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.516453 | 0.611962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V31 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 0.638251 | NaN | NaN | NaN | NaN | NaN |
| V32 | NaN | NaN | NaN | NaN | 0.621388 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V33 | NaN | NaN | NaN | NaN | NaN | 0.588739 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.512303 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.559117 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.751514 | NaN | NaN |
| V37 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.531095 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V38 | NaN | 0.660713 | NaN | NaN | NaN | 0.631617 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| V39 | NaN | NaN | 0.537108 | NaN | NaN | NaN | NaN | NaN | NaN | 0.563899 | ... | NaN | NaN | NaN | NaN | 0.751514 | NaN | NaN | NaN | NaN | NaN |
| V40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Target | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
41 rows × 41 columns
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# build the histogram and boxplot for all the predictor variables
# use the for loop to iterate through all the variables
for column in cols:
distribution_plot_wrt_target(data, column, "Target")
data.isnull().sum()
V1 57 V2 46 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
imputer = KNNImputer(n_neighbors=5)
# defining a list with names of columns that will be used for imputation
reqd_col_for_impute = ["V1", "V2"]
# Fit and transform the train data
renew_train[reqd_col_for_impute] = imputer.fit_transform(
renew_train[reqd_col_for_impute]
)
# Transform the test data
renew_test[reqd_col_for_impute] = imputer.transform(renew_test[reqd_col_for_impute])
# Check if any values are missing after median imputation in train dataset
renew_train.isnull().sum()
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
# Check if any values are missing after median imputation in train dataset
renew_test.isnull().sum()
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
# Review first 5 rows again of the train dataset
renew_train.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | -2.909996 | -2.568662 | 4.109032 | 1.316672 | -1.620594 | -3.827212 | -1.616970 | 0.669006 | 0.387045 | 0.853814 | ... | -3.782686 | -6.823172 | 4.908562 | 0.481554 | 5.338051 | 2.381297 | -3.127756 | 3.527309 | -3.019581 | 0 |
| 2 | 4.283674 | 5.105381 | 6.092238 | 2.639922 | -1.041357 | 1.308419 | -1.876140 | -9.582412 | 3.469504 | 0.763395 | ... | -3.097934 | 2.690334 | -1.643048 | 7.566482 | -3.197647 | -3.495672 | 8.104779 | 0.562085 | -4.227426 | 0 |
| 3 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 4 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
5 rows × 41 columns
Minimum cost/Cost associated with modelLet's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
TP = confusion_matrix(target, model.predict(predictors))[1, 1]
FP = confusion_matrix(target, model.predict(predictors))[0, 1]
FN = confusion_matrix(target, model.predict(predictors))[1, 0]
Cost = TP * 15 + FP * 5 + FN * 40 # maintenance cost by using model
Min_Cost = (
TP + FN
) * 15 # minimum possible maintenance cost = number of actual positives
Percent = (
Min_Cost / Cost
) # ratio of minimum possible maintenance cost and maintenance cost by model
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
"Minimum_Vs_Model_cost": Percent,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost)Eventually, all 3 metrics will do the same work in the backend and the only difference will be in the scale of the values of the metric.
The metric provided in the next cell is to maximize(minimum possible maintenance cost/maintenance cost)
# defining metric to be used for optimization and with cross-validation
def Minimum_Vs_Model_cost(y_train, y_pred):
"""
We want the model to optimize the maintenance cost and reduce it to the lowest possible value.
The lowest possible maintenance cost will be achieved when each sample is predicted correctly.
In such a scenario, the maintenance cost will be the total number of failures times the maintenance cost of replacing one generator,
which is given by (TP + FN) * 40 (i.e., the actual positives*40).
For any other scenario,
the maintenance cost associated with the model will be given by (TP * 15 + FP * 5 + FN * 40).
We will use the ratio of these two maintenance costs as the cost function for our model.
The greater the ratio, the lower the associated maintenance cost and the better the model.
"""
TP = confusion_matrix(y_train, y_pred)[1, 1]
FP = confusion_matrix(y_train, y_pred)[0, 1]
FN = confusion_matrix(y_train, y_pred)[1, 0]
return ((TP + FN) * 15) / (TP * 15 + FP * 5 + FN * 40)
# A value of .80 here, will represent that the minimum maintenance cost is 80% of the maintenance cost associated with the model.
# Since minimum maintenance cost is constant for any data, when minimum cost will become 100% of maintenance cost associated with the model
# Model will have give the least possible maintenance cost.
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(Minimum_Vs_Model_cost, greater_is_better=True)
# Higher the values, the lower the maintenance cost
X_temp = renew_train.drop(["Target"], axis=1)
y_temp = renew_train["Target"]
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape)
(30000, 40) (10000, 40)
lr = LogisticRegression(random_state=1)
lr.fit(X_train, y_train)
LogisticRegression(random_state=1)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_bfr)
plt.show()
# Calculating different metrics on train set
log_reg_model_train_perf = model_performance_classification_sklearn(
lr, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.967333 | 0.485976 | 0.853319 | 0.61927 | 0.53063 |
# Calculating different metrics on validation set
log_reg_model_val_perf = model_performance_classification_sklearn(lr, X_val, y_val)
print("Validation performance:")
log_reg_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.9661 | 0.462523 | 0.848993 | 0.598817 | 0.519962 |
# creating confusion matrix
confusion_matrix_sklearn(lr, X_val, y_val)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
model = DecisionTreeClassifier(
criterion="gini", class_weight="balanced", random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=1)
# Check the performance on training dataset
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
print(decision_tree_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the performance on validation dataset
decision_tree_perf_val = model_performance_classification_sklearn(model, X_val, y_val)
decision_tree_perf_val
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.9717 | 0.720293 | 0.751908 | 0.735761 | 0.647082 |
# Fit the bagging estimator model
bagging_estimator = BaggingClassifier(random_state=1)
bagging_estimator.fit(X_train, y_train)
BaggingClassifier(random_state=1)
# Check the metrics for the bagging estimator on the training dataset
bagging_classifier_perf_train = model_performance_classification_sklearn(
bagging_estimator, X_train, y_train
)
print(bagging_classifier_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.996867 0.943293 0.999354 0.970514 0.913479
# Check the metrics for the bagging estimator on the validation dataset
bagging_classifier_perf_val = model_performance_classification_sklearn(
bagging_estimator, X_val, y_val
)
print(bagging_classifier_perf_val)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9839 0.736746 0.959524 0.833506 0.690076
# Check the confusion matrix for bagging estimator
confusion_matrix_sklearn(bagging_estimator, X_val, y_val)
# Train the random forest classifier
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
# Check the metrics for the random forest classifier on the training dataset
rf_perf_train = model_performance_classification_sklearn(rf_estimator, X_train, y_train)
print(rf_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the metrics for the random forest classifier on the validation dataset
rf_perf_val = model_performance_classification_sklearn(rf_estimator, X_val, y_val)
print(rf_perf_val)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9869 0.767824 0.990566 0.865088 0.719737
# Check the confusion matrix for random forest classifier
confusion_matrix_sklearn(rf_estimator, X_val, y_val)
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train, y_train)
AdaBoostClassifier(random_state=1)
# Check the AdaBoost classifier metrics on training dataset
abc_perf_train = model_performance_classification_sklearn(abc, X_train, y_train)
print(abc_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.975867 0.637195 0.890119 0.742715 0.613161
# Check the AdaBoost classifier metrics on validation dataset
abc_perf_val = model_performance_classification_sklearn(abc, X_val, y_val)
print(abc_perf_val)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9729 0.61426 0.848485 0.712619 0.595428
# Check the confusion matrix for AdaBoost Classifier
confusion_matrix_sklearn(abc, X_val, y_val)
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train, y_train)
GradientBoostingClassifier(random_state=1)
# Check the GB metrics for training dataset
gbc_perf_train = model_performance_classification_sklearn(gbc, X_train, y_train)
print(gbc_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.987133 0.779878 0.980828 0.868886 0.728889
# Check the GB metrics for validation dataset
gbc_perf_val = model_performance_classification_sklearn(gbc, X_val, y_val)
print(gbc_perf_val)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9829 0.720293 0.956311 0.821689 0.67698
# Check the confusion matrix for GB
confusion_matrix_sklearn(gbc, X_val, y_val)
xgb = XGBClassifier(random_state=1, eval_metric="logloss")
xgb.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Check the performance on the training dataset
xgb_perf_train = model_performance_classification_sklearn(xgb, X_train, y_train)
print(xgb_perf_train)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the performance on the validation dataset
xgb_perf_val = model_performance_classification_sklearn(xgb, X_val, y_val)
print(xgb_perf_val)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9902 0.83181 0.986985 0.902778 0.778832
# Check the confusion matrix for XGB
confusion_matrix_sklearn(xgb, X_val, y_val)
# Synthetic Minority Over Sampling Technique
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 1640 Before UpSampling, counts of label 'No': 28360 After UpSampling, counts of label 'Yes': 28360 After UpSampling, counts of label 'No': 28360 After UpSampling, the shape of train_X: (56720, 40) After UpSampling, the shape of train_y: (56720,)
### Logistic Regression with oversampled data
lr_over = LogisticRegression(random_state=1)
lr_over.fit(X_train_over, y_train_over)
LogisticRegression(random_state=1)
# Calculating different metrics on train set
log_reg_model_train_perf_over = model_performance_classification_sklearn(
lr_over, X_train_over, y_train_over
)
print("Training performance:")
log_reg_model_train_perf_over
Training performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.874418 | 0.875529 | 0.873588 | 0.874558 | 0.800203 |
# Calculating different metrics on validation set
log_reg_model_val_perf_over = model_performance_classification_sklearn(
lr_over, X_val, y_val
)
print("Validation performance:")
log_reg_model_val_perf_over
Validation performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.873 | 0.839122 | 0.279707 | 0.419561 | 0.502911 |
# creating confusion matrix
confusion_matrix_sklearn(lr_over, X_val, y_val)
#fit the oversampled data
model = DecisionTreeClassifier(
criterion="gini", class_weight="balanced", random_state=1
)
model.fit(X_train_over, y_train_over)
DecisionTreeClassifier(class_weight='balanced', random_state=1)
# Check the performance on training dataset
decision_tree_perf_train_over = model_performance_classification_sklearn(
model, X_train_over, y_train_over
)
decision_tree_perf_train_over
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Check the performance on validation dataset
decision_tree_perf_val_over = model_performance_classification_sklearn(
model, X_val, y_val
)
decision_tree_perf_val_over
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.9511 | 0.813528 | 0.534856 | 0.645395 | 0.646572 |
# Fit the bagging estimator model
bagging_estimator_over = BaggingClassifier(random_state=1)
bagging_estimator_over.fit(X_train_over, y_train_over)
BaggingClassifier(random_state=1)
# Check the metrics for the bagging estimator on the training dataset
bagging_classifier_perf_train_over = model_performance_classification_sklearn(
bagging_estimator_over, X_train_over, y_train_over
)
print(bagging_classifier_perf_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.998907 0.998237 0.999576 0.998906 0.99693
# Check the metrics for the bagging estimator on the validation dataset
bagging_classifier_perf_val_over = model_performance_classification_sklearn(
bagging_estimator_over, X_val, y_val
)
print(bagging_classifier_perf_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9839 0.837294 0.864151 0.850511 0.760426
# Check the confusion matrix for bagging estimator
confusion_matrix_sklearn(bagging_estimator_over, X_val, y_val)
# Train the random forest classifier
rf_estimator_over = RandomForestClassifier(random_state=1)
rf_estimator_over.fit(X_train_over, y_train_over)
RandomForestClassifier(random_state=1)
# Check the metrics for the random forest classifier on the training dataset
rf_perf_train_over = model_performance_classification_sklearn(
rf_estimator_over, X_train_over, y_train_over
)
print(rf_perf_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the metrics for the random forest classifier on the validation dataset
rf_perf_val_over = model_performance_classification_sklearn(
rf_estimator_over, X_val, y_val
)
print(rf_perf_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9908 0.862888 0.965235 0.911197 0.807182
# Check the confusion matrix for random forest classifier
confusion_matrix_sklearn(rf_estimator_over, X_val, y_val)
abc_over = AdaBoostClassifier(random_state=1)
abc_over.fit(X_train_over, y_train_over)
AdaBoostClassifier(random_state=1)
# Check the AdaBoost classifier metrics on training dataset
abc_perf_train_over = model_performance_classification_sklearn(
abc_over, X_train_over, y_train_over
)
print(abc_perf_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.904672 0.893794 0.913672 0.903624 0.829765
# Check the AdaBoost classifier metrics on validation dataset
abc_perf_val_over = model_performance_classification_sklearn(abc_over, X_val, y_val)
print(abc_perf_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9055 0.850091 0.350151 0.496 0.563143
# Check the confusion matrix for ABC
confusion_matrix_sklearn(abc_over, X_val, y_val)
gbc_over = GradientBoostingClassifier(random_state=1)
gbc_over.fit(X_train_over, y_train_over)
GradientBoostingClassifier(random_state=1)
# Check the GB metrics for training dataset
gbc_perf_train_over = model_performance_classification_sklearn(
gbc_over, X_train_over, y_train_over
)
print(gbc_perf_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.944164 0.915867 0.970809 0.942538 0.870019
# Check the GB metrics for validation dataset
gbc_perf_val_over = model_performance_classification_sklearn(gbc_over, X_val, y_val)
print(gbc_perf_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9657 0.88117 0.634211 0.737567 0.731283
# Check the confusion matrix for GB
confusion_matrix_sklearn(gbc_over, X_val, y_val)
xgb_over = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Check the performance on the training dataset
xgb_perf_train_over = model_performance_classification_sklearn(
xgb_over, X_train_over, y_train_over
)
print(xgb_perf_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.998713 0.998061 0.999364 0.998712 0.996568
# Check the performance on the validation dataset
xgb_perf_val_over = model_performance_classification_sklearn(xgb_over, X_val, y_val)
print(xgb_perf_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9882 0.879342 0.902439 0.890741 0.811172
# Check the confusion matrix for XGB
confusion_matrix_sklearn(xgb_over, X_val, y_val)
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
### Logistic Regression with undersampled data
lr_un = LogisticRegression(random_state=1)
lr_un.fit(X_train_un, y_train_un)
LogisticRegression(random_state=1)
# Calculating different metrics on train set
log_reg_model_train_perf_un = model_performance_classification_sklearn(
lr_un, X_train_un, y_train_un
)
print("Training performance:")
log_reg_model_train_perf_un
Training performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.859451 | 0.855488 | 0.862323 | 0.858892 | 0.777374 |
# Calculating different metrics on validation set
log_reg_model_val_perf_un = model_performance_classification_sklearn(
lr_un, X_val, y_val
)
print("Validation performance:")
log_reg_model_val_perf_un
Validation performance:
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.8639 | 0.846435 | 0.266092 | 0.404897 | 0.491612 |
# creating confusion matrix
confusion_matrix_sklearn(lr_un, X_val, y_val)
### Decision Tree - Undersampled
#fit the Undersampled data
model = DecisionTreeClassifier(
criterion="gini", class_weight="balanced", random_state=1
)
model.fit(X_train_un, y_train_un)
DecisionTreeClassifier(class_weight='balanced', random_state=1)
# Check the performance on training dataset
decision_tree_perf_train_un = model_performance_classification_sklearn(
model, X_train_un, y_train_un
)
decision_tree_perf_train_un
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Check the performance on validation dataset
decision_tree_perf_val_un = model_performance_classification_sklearn(
model, X_val, y_val
)
decision_tree_perf_val_un
| Accuracy | Recall | Precision | F1 | Minimum_Vs_Model_cost | |
|---|---|---|---|---|---|
| 0 | 0.8664 | 0.853748 | 0.271039 | 0.411454 | 0.497725 |
#Fit the bagging estimator model
bagging_estimator_un = BaggingClassifier(random_state=1)
bagging_estimator_un.fit(X_train_un, y_train_un)
BaggingClassifier(random_state=1)
# Check the metrics for the bagging estimator on the training dataset
bagging_classifier_perf_train_un = model_performance_classification_sklearn(
bagging_estimator_un, X_train_un, y_train_un
)
print(bagging_classifier_perf_train_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.989024 0.979878 0.998137 0.988923 0.966981
# Check the metrics for the bagging estimator on the validation dataset
bagging_classifier_perf_val_un = model_performance_classification_sklearn(
bagging_estimator_un, X_val, y_val
)
print(bagging_classifier_perf_val_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9505 0.862888 0.529148 0.656011 0.673645
# Check the confusion matrix for bagging estimator
confusion_matrix_sklearn(bagging_estimator_un, X_val, y_val)
# Train the random forest classifier
rf_estimator_un = RandomForestClassifier(random_state=1)
rf_estimator_un.fit(X_train_un, y_train_un)
RandomForestClassifier(random_state=1)
# Check the metrics for the random forest classifier on the training dataset
rf_perf_train_un = model_performance_classification_sklearn(
rf_estimator_un, X_train_un, y_train_un
)
print(rf_perf_train_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the metrics for the random forest classifier on the validation dataset
rf_perf_val_un = model_performance_classification_sklearn(rf_estimator_un, X_val, y_val)
print(rf_perf_val_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9662 0.884826 0.637681 0.741194 0.735545
# Check the confusion matrix for random forest classifier
confusion_matrix_sklearn(rf_estimator_un, X_val, y_val)
abc_un = AdaBoostClassifier(random_state=1)
abc_un.fit(X_train_un, y_train_un)
AdaBoostClassifier(random_state=1)
# Check the AdaBoost classifier metrics on training dataset
abc_perf_train_un = model_performance_classification_sklearn(
abc_un, X_train_un, y_train_un
)
print(abc_perf_train_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.906098 0.893902 0.91625 0.904938 0.83052
# Check the AdaBoost classifier metrics on validation dataset
abc_perf_val_un = model_performance_classification_sklearn(abc_un, X_val, y_val)
print(abc_perf_val_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.8797 0.864717 0.295256 0.440205 0.522611
# Check the confusion matrix for ABC
confusion_matrix_sklearn(abc_un, X_val, y_val)
gbc_un = GradientBoostingClassifier(random_state=1)
gbc_un.fit(X_train_un, y_train_un)
GradientBoostingClassifier(random_state=1)
# Check the GB metrics for training dataset
gbc_perf_train_un = model_performance_classification_sklearn(
gbc_un, X_train_un, y_train_un
)
print(gbc_perf_train_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.952134 0.918293 0.984957 0.950458 0.876537
# Check the GB metrics for validation dataset
gbc_perf_val_un = model_performance_classification_sklearn(gbc_un, X_val, y_val)
print(gbc_perf_val_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9513 0.888483 0.532895 0.66621 0.691821
# Check the confusion matrix for GB
confusion_matrix_sklearn(gbc_un, X_val, y_val)
xgb_un = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_un.fit(X_train_un, y_train_un)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Check the performance on the training dataset
xgb_perf_train_un = model_performance_classification_sklearn(
xgb_un, X_train_un, y_train_un
)
print(xgb_perf_train_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the performance on the validation dataset
xgb_perf_val_un = model_performance_classification_sklearn(xgb_un, X_val, y_val)
print(xgb_perf_val_un)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9687 0.897623 0.656417 0.758301 0.753444
# Check the confusion matrix for XGB
confusion_matrix_sklearn(xgb_un, X_val, y_val)
# Comparing the metric in the default dataset
models_comp_df = pd.concat(
[
log_reg_model_train_perf.Minimum_Vs_Model_cost,
log_reg_model_val_perf.Minimum_Vs_Model_cost,
decision_tree_perf_train.Minimum_Vs_Model_cost,
decision_tree_perf_val.Minimum_Vs_Model_cost,
bagging_classifier_perf_train.Minimum_Vs_Model_cost,
bagging_classifier_perf_val.Minimum_Vs_Model_cost,
rf_perf_train.Minimum_Vs_Model_cost,
rf_perf_val.Minimum_Vs_Model_cost,
abc_perf_train.Minimum_Vs_Model_cost,
abc_perf_val.Minimum_Vs_Model_cost,
gbc_perf_train.Minimum_Vs_Model_cost,
gbc_perf_val.Minimum_Vs_Model_cost,
xgb_perf_train.Minimum_Vs_Model_cost,
xgb_perf_val.Minimum_Vs_Model_cost,
],
axis=1,
)
models_comp_df.columns = [
"LG Train",
"LG Validation",
"DT Train ",
"DT Validation ",
"BG Train ",
"BG Validation ",
"Random Forest Train",
"Random Forest Validation",
"AdaBoost - Train",
"AdaBoost - Validation",
"GBoost - Train",
"GBoost - Validation",
"XG Boost - Train",
"XG Boost - Validation",
]
print("Comparison of all models for default dataset:")
models_comp_df
Comparison of all models for default dataset:
| LG Train | LG Validation | DT Train | DT Validation | BG Train | BG Validation | Random Forest Train | Random Forest Validation | AdaBoost - Train | AdaBoost - Validation | GBoost - Train | GBoost - Validation | XG Boost - Train | XG Boost - Validation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.53063 | 0.519962 | 1.0 | 0.647082 | 0.913479 | 0.690076 | 1.0 | 0.719737 | 0.613161 | 0.595428 | 0.728889 | 0.67698 | 1.0 | 0.778832 |
# Comparing the metric in the oversampling dataset
models_comp_over = pd.concat(
[
log_reg_model_train_perf_over.Minimum_Vs_Model_cost,
log_reg_model_val_perf_over.Minimum_Vs_Model_cost,
decision_tree_perf_train_over.Minimum_Vs_Model_cost,
decision_tree_perf_val_over.Minimum_Vs_Model_cost,
bagging_classifier_perf_train_over.Minimum_Vs_Model_cost,
bagging_classifier_perf_val_over.Minimum_Vs_Model_cost,
rf_perf_train_over.Minimum_Vs_Model_cost,
rf_perf_val_over.Minimum_Vs_Model_cost,
abc_perf_train_over.Minimum_Vs_Model_cost,
abc_perf_val_over.Minimum_Vs_Model_cost,
gbc_perf_train_over.Minimum_Vs_Model_cost,
gbc_perf_val_over.Minimum_Vs_Model_cost,
xgb_perf_train_over.Minimum_Vs_Model_cost,
xgb_perf_val_over.Minimum_Vs_Model_cost,
],
axis=1,
)
models_comp_over.columns = [
"LG Train",
"LG Validation",
"DT Train ",
"DT Validation ",
"BG Train ",
"BG Validation ",
"Random Forest Train",
"Random Forest Validation",
"AdaBoost - Train",
"AdaBoost - Validation",
"GBoost - Train",
"GBoost - Validation",
"XG Boost - Train",
"XG Boost - Validation",
]
print("Comparison of all models for oversampled dataset:")
models_comp_over
Comparison of all models for oversampled dataset:
| LG Train | LG Validation | DT Train | DT Validation | BG Train | BG Validation | Random Forest Train | Random Forest Validation | AdaBoost - Train | AdaBoost - Validation | GBoost - Train | GBoost - Validation | XG Boost - Train | XG Boost - Validation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.800203 | 0.502911 | 1.0 | 0.646572 | 0.99693 | 0.760426 | 1.0 | 0.807182 | 0.829765 | 0.563143 | 0.870019 | 0.731283 | 0.996568 | 0.811172 |
# Comparing the metric in the undersampling dataset
models_comp_un = pd.concat(
[
log_reg_model_train_perf_un.Minimum_Vs_Model_cost,
log_reg_model_val_perf_un.Minimum_Vs_Model_cost,
decision_tree_perf_train_un.Minimum_Vs_Model_cost,
decision_tree_perf_val_un.Minimum_Vs_Model_cost,
bagging_classifier_perf_train_un.Minimum_Vs_Model_cost,
bagging_classifier_perf_val_un.Minimum_Vs_Model_cost,
rf_perf_train_un.Minimum_Vs_Model_cost,
rf_perf_val_un.Minimum_Vs_Model_cost,
abc_perf_train_un.Minimum_Vs_Model_cost,
abc_perf_val_un.Minimum_Vs_Model_cost,
gbc_perf_train_un.Minimum_Vs_Model_cost,
gbc_perf_val_un.Minimum_Vs_Model_cost,
xgb_perf_train_un.Minimum_Vs_Model_cost,
xgb_perf_val_un.Minimum_Vs_Model_cost,
],
axis=1,
)
models_comp_un.columns = [
"LG Train",
"LG Validation",
"DT Train ",
"DT Validation ",
"BG Train ",
"BG Validation ",
"Random Forest Train",
"Random Forest Validation",
"AdaBoost - Train",
"AdaBoost - Validation",
"GBoost - Train",
"GBoost - Validation",
"XG Boost - Train",
"XG Boost - Validation",
]
print("Comparison of all models for undersampled dataset:")
models_comp_un
Comparison of all models for undersampled dataset:
| LG Train | LG Validation | DT Train | DT Validation | BG Train | BG Validation | Random Forest Train | Random Forest Validation | AdaBoost - Train | AdaBoost - Validation | GBoost - Train | GBoost - Validation | XG Boost - Train | XG Boost - Validation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.777374 | 0.491612 | 1.0 | 0.497725 | 0.966981 | 0.673645 | 1.0 | 0.735545 | 0.83052 | 0.522611 | 0.876537 | 0.691821 | 1.0 | 0.753444 |
# Combine all the models along with their data sampling methods
frames = [models_comp_df, models_comp_over, models_comp_un]
combined = pd.concat(frames)
print(combined)
# Write the models & sampling methods to a csv file
# combined.to_csv("out.csv")
LG Train LG Validation DT Train DT Validation BG Train \ 0 0.530630 0.519962 1.0 0.647082 0.913479 0 0.800203 0.502911 1.0 0.646572 0.996930 0 0.777374 0.491612 1.0 0.497725 0.966981 BG Validation Random Forest Train Random Forest Validation \ 0 0.690076 1.0 0.719737 0 0.760426 1.0 0.807182 0 0.673645 1.0 0.735545 AdaBoost - Train AdaBoost - Validation GBoost - Train \ 0 0.613161 0.595428 0.728889 0 0.829765 0.563143 0.870019 0 0.830520 0.522611 0.876537 GBoost - Validation XG Boost - Train XG Boost - Validation 0 0.676980 1.000000 0.778832 0 0.731283 0.996568 0.811172 0 0.691821 1.000000 0.753444
Comparing all the models and data sampling methods,I have arrived at the below conclusions
The XG Boost model with the oversampled data has a score of 0.81 for the target metric on the validation dataset. The train dataset target metric score is 0.99. There is a good chance to tune the hyper parameters and arrive at better performing model using this model. This is the first model I am picking up to improve the performance.
The random forest model with the oversampled data has a score of 0.80 for the target metric on the validation dataset. The training is at 1.0 for this metric. As we need to maximize the target metric value,I am choosing this as my second model for improving the performance.
The GB model with the oversampled data has a score of 0.73 for the target metric on the validation dataset. The train dataset target metric score is 0.87. As we need to maximize the target metric value, I am choosing this as my third model for improving the performance.
param_grid={'n_estimators':np.arange(150,300,50),'scale_pos_weight':[5,10], 'learning_rate':[0.1,0.2], 'gamma':[0,3,5], 'subsample':[0.8,0.9]}
param_grid = { "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)], "n_estimators": np.arange(75,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}
param_grid = { "n_estimators": np.arange(10, 110, 20), "learning_rate": [ 0.2, 0.05, 1], "base_estimator": [ DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1)]}
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid = { 'max_samples': [0.8,0.9], 'max_features': [0.8,0.9], 'n_estimators' : [40,50]}
param_grid = { "n_estimators": [150,250], "min_samples_leaf": np.arange(1, 3), "max_features": ['sqrt','log2'], "max_samples": np.arange(0.2, 0.6, 0.1)}
param_grid = {'max_depth': np.arange(2,20), 'min_samples_leaf': [1, 2, 5, 7], 'max_leaf_nodes' : [5, 10,15], 'min_impurity_decrease': [0.0001,0.001] }
%%time
# Choose the type of classifier.
# In addition to the parameters recommended by Great Learning,I have used max_depth & min_child_weight to
# improve the model performance
# The other parameter values as recommended by Great Learning are as-is below
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
param_grid = {
"n_estimators": np.arange(150, 300, 50),
"scale_pos_weight": [5, 10],
"learning_rate": [0.1, 0.2],
"gamma": [0, 3, 5],
"subsample": [0.8, 0.9],
"max_depth": np.arange(7, 11, 2), # added parameter
"min_child_weight": np.arange(8, 9, 1), # added parameter
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = RandomizedSearchCV(xgb_tuned, param_grid, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train_over, y_train_over)
Wall time: 27min 40s
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.1, max_delta_step=0,
max_depth=7, min_child_weight=8, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=10, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
# Check the performance on the training dataset
xgb_perf_tuned_train_over = model_performance_classification_sklearn(
xgb_tuned, X_train_over, y_train_over
)
print(xgb_perf_tuned_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.994411 1.0 0.988946 0.994442 0.996288
# Check the performance on the validation dataset
xgb_perf_tuned_val_over = model_performance_classification_sklearn(xgb_tuned, X_val, y_val)
print(xgb_perf_tuned_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9741 0.903108 0.705714 0.792302 0.776989
# Check the confusion matrix for XGB
confusion_matrix_sklearn(xgb_tuned, X_val, y_val)
# Drop the target variable in the predictor variables and in the target variable drop all the predictor variables
# Please note that this is for the test dataset
X_test = renew_test.drop(["Target"], axis=1)
y_test = renew_test["Target"]
# Check the performance on the test dataset
# The expectation of the tuned model on the test dataset is to exceed 0.78 for the Minimum_Vs_Model_cost metric
xgb_perf_tuned_test_over = model_performance_classification_sklearn(
xgb_tuned, X_test, y_test
)
print(xgb_perf_tuned_test_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9713 0.882998 0.684136 0.77095 0.751374
# Choose the type of classifier.
rf_estimator_tuned = RandomForestClassifier(random_state=1, class_weight="balanced")
%%time
# For Random Forest:
# No changes in the parameters. They are as recommended by Great Learning
param_grid = { "n_estimators": [150,250], "min_samples_leaf": np.arange(1, 3), "max_features": ['sqrt','log2'], "max_samples": np.arange(0.2, 0.6, 0.1)}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = RandomizedSearchCV(rf_estimator_tuned, param_grid, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train_over, y_train_over)
Wall time: 18min 25s
RandomForestClassifier(class_weight='balanced', max_features='sqrt',
max_samples=0.5000000000000001, n_estimators=250,
random_state=1)
# Check the performance on the training dataset
rf_estimator_perf_tuned_train_over = model_performance_classification_sklearn(
rf_estimator_tuned, X_train_over, y_train_over
)
print(rf_estimator_perf_tuned_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.998819 0.99792 0.999717 0.998818 0.996451
# Check the performance on the validation dataset
rf_estimator_perf_tuned_val_over = model_performance_classification_sklearn(rf_estimator_tuned, X_val, y_val)
print(rf_estimator_perf_tuned_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9902 0.872029 0.944554 0.906844 0.812779
# Check the confusion matrix for Random Forest tuned model
confusion_matrix_sklearn(rf_estimator_tuned, X_val, y_val)
# Check the performance on the test dataset
# The expectation of the tuned model on the test dataset is to exceed 0.78 for the Minimum_Vs_Model_cost metric
rf_estimator_tuned_test_over = model_performance_classification_sklearn(
rf_estimator_tuned, X_test, y_test
)
print(rf_estimator_tuned_test_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9896 0.859232 0.945674 0.900383 0.799318
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1))
# GradientBoostingClassifier(random_state=1)
%%time
#For Gradient Boosting:
param_grid = { "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7]}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = RandomizedSearchCV(gbc_tuned, param_grid, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train_over, y_train_over)
Wall time: 29min
GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7,
n_estimators=125, subsample=0.7)
# Check the performance on the training dataset
gbc_perf_tuned_train_over = model_performance_classification_sklearn(
gbc_tuned, X_train_over, y_train_over
)
print(gbc_perf_tuned_train_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 1.0 1.0 1.0 1.0 1.0
# Check the performance on the validation dataset
gbc_perf_tuned_val_over = model_performance_classification_sklearn(gbc_tuned, X_val, y_val)
print(gbc_perf_tuned_val_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9511 0.813528 0.534856 0.645395 0.646572
# Check the confusion matrix for GB tuned model
confusion_matrix_sklearn(gbc_tuned, X_val, y_val)
# Check the performance on the test dataset
# The expectation of the tuned model on the test dataset is to exceed 0.78 for the Minimum_Vs_Model_cost metric
gbc_tuned_test_over = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
print(gbc_tuned_test_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9478 0.800731 0.514689 0.626609 0.631397
# Combine all the models on their test results
frames = [xgb_perf_tuned_test_over, rf_estimator_tuned_test_over, gbc_tuned_test_over]
combined = pd.concat(frames)
print(combined)
# Write the models & sampling methods to a csv file
# combined.to_csv("out1.csv")
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9713 0.882998 0.684136 0.770950 0.751374 0 0.9896 0.859232 0.945674 0.900383 0.799318 0 0.9478 0.800731 0.514689 0.626609 0.631397
rf_estimator_tuned_test_over = model_performance_classification_sklearn(
rf_estimator_tuned, X_test, y_test
)
print(rf_estimator_tuned_test_over)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9896 0.859232 0.945674 0.900383 0.799318
# Check the confusion matrix for Random Forest tuned model
confusion_matrix_sklearn(rf_estimator_tuned, X_test, y_test)
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Pipelines for productionizing the model
# Now, we have a final model. let's use pipelines to put the model into production
# creating a list of numerical variables
numerical_features = [
"V1",
"V2",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
# Since the median and KNN imputers gave the same result,I am going with the simpleimputer for building the pipeline
# numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
numeric_transformer = Pipeline(steps=[("imputer",KNNImputer(n_neighbors=5))])
# combining the numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
],
remainder="passthrough",
)
# Separating target variable and other variables
X = renew_test.drop(columns="Target")
y = renew_test["Target"]
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"RandomForestClassifier",
RandomForestClassifier(
class_weight="balanced",
max_features="sqrt",
max_samples=0.4000000000000001,
n_estimators=250,
random_state=1,
min_samples_leaf=3,
),
),
]
)
# Fit the model on training data
model.fit(X, y)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
KNNImputer())]),
['V1', 'V2'])])),
('RandomForestClassifier',
RandomForestClassifier(class_weight='balanced',
max_features='sqrt',
max_samples=0.4000000000000001,
min_samples_leaf=3, n_estimators=250,
random_state=1))])
rf_estimator_tuned_test_pipeline = model_performance_classification_sklearn(model, X, y)
print(rf_estimator_tuned_test_pipeline)
Accuracy Recall Precision F1 Minimum_Vs_Model_cost 0 0.9911 0.844607 0.991416 0.912142 0.792754